In these exercises we will see the power of the libraries ggplot2 and plotly to make sense of statistical data. The goal is to reproduce the moving chart that you can see in this video from Hans Rosling – I invite you to watch his other videos, they are quite enlightning and inspiring:
For this, we will need to gather the data:
The first thing to do is to load and regroup all these datasets into a single one.
Load the tidyverse library and, using read_csv(), load the 4 datasets in 4 separate data.frames called children, income, pop and religion.
To reproduce the chart on the video, we need to determine the dominant religion in each country. In the religion dataset, add a column Religion that will give the name of the dominant religion for each country. For this, you might want to use this method that returns the name of the column containing the maximum of each row of a data.frame:
## V1 V2 V3
## 1 2 7 9
## 2 8 3 6
## 3 1 5 4
## [1] "V3" "V1" "V2"
pivot_longer(), make all datasets tidy.children should now contain 3 columns: Country, Year and Fertility.income should now contain 3 columns: Country, Year and Income.pop should now contain 3 columns: Country, Year and Population.We will only consider data from 1800 to 2018. Example of syntax using the pipe operator %>%:
## # A tibble: 2 x 5
## name `2010` `2011` `2012` `2014`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Kevin 10 11 12 123
## 2 Jane 122 56 23 4
DF %>%
select(name, '2010':'2012') %>%
pivot_longer(col=-name,
names_to="Year",
values_to="Score",
names_transform=list(Year = as.numeric))## # A tibble: 6 x 3
## name Year Score
## <chr> <dbl> <dbl>
## 1 Kevin 2010 10
## 2 Kevin 2011 11
## 3 Kevin 2012 12
## 4 Jane 2010 122
## 5 Jane 2011 56
## 6 Jane 2012 23
The line names_transform=list(Year = as.numeric) is here to convert the character year values to numerical values.
dat, containing the columns Country, Year, Population, Religion, Fertility and Income. Look into the inner_join() function of the dplyr library (which is part of the tidyverse library). For the religion dataset, we will consider that the proportions of 2010 are representative of all times.You should end up with a dataset like this one:
## # A tibble: 37,887 x 6
## Country Year Fertility Income Population Religion
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Afghanistan 1800 7 603 3280000 Muslims
## 2 Afghanistan 1801 7 603 3280000 Muslims
## 3 Afghanistan 1802 7 603 3280000 Muslims
## 4 Afghanistan 1803 7 603 3280000 Muslims
## 5 Afghanistan 1804 7 603 3280000 Muslims
## 6 Afghanistan 1805 7 603 3280000 Muslims
## 7 Afghanistan 1806 7 603 3280000 Muslims
## 8 Afghanistan 1807 7 603 3280000 Muslims
## 9 Afghanistan 1808 7 603 3280000 Muslims
## 10 Afghanistan 1809 7 603 3280000 Muslims
## # … with 37,877 more rows
Now our dataset is ready, let’s plot it.
Load the library ggplot2 and set the global theme to theme_bw() using theme_set()
Create a subset of dat concerning your origin country. For me it will be dat_france
Plot the evolution of the income per capita and the number of children per woman as a function of the years, and make it look like that (notice the kinks during the two world wars):
Create a subset of dat containing the data for your country plus all the neighbor countries (if you come from an island, the nearest countries…). For me, dat_france_region will contain data from France, Spain, Italy, Switzerland, Germany, Luxembourg and Belgium.
Plot again income and fertility as a function of the years, but add a color corresponding to the country and a point size to its population:
plotly and make the previous graphs interactive. You can make an interactive graph by calling ggplotly(), like that:library(plotly)
P <- ggplot(data = dat_france, aes(x=Population, y=Income))+
geom_point()
ggplotly(P)# add dynamicTicks=TRUE allows redrawing ticks when zooming inframe = in the chart’s aesthetics. So now, make the graph of the video ! (you can also add the aesthetics id=Country to show the country name in the popup when hovering on a point).